COVID-19 Data Analysis


Purpose

I got off a call with a F1 Simulator Software Architect the other day. It was during our talk that Pandas was mentioned for processing data. I never heard of Pandas before, so I did my research after the call. I went to one of my full-time friends in the Vision Automation team at Tesla and asked him what is Pandas. He made the joke that it is like endangered species and then proceeded to tell me the actual software -- Pandas. He said it was a library in python that is used a lot to process dataset in Excel with a bunch of data. I was like SOLD! Anything to make my life easier and more learning! I had prior experience in python, Excel, and C# from UT Austin and Tesla so I was excited. I also want to code in Jupyter Notebook as well so used this project for it.

Thought Process

Overview

  • Research on YouTube about Pandas
  • Apply the knowledge by trying an example
  •   One of the Hackathon that I did recently we were using a big dataset for our product. We used Kaggle (a website that is free with one of the world's largest data sets).

Process Step

  1. Found a dataset that I was interested in analyzing
  2.   https://www.kaggle.com/datasets/meirnizri/covid19-dataset?resource=download
  3. Install Jupyter Notebook: have to open terminal on the Jupyter Notebook location
  4. Install python (already have -- upgraded it)
  5. Load dataframes for python to manipulate in Pandas
  6.   Fix the indexing of the Excel with the Column A in Excel as 0 (zero)
  7. Read the csv data: rearranging the filtering the data
  8. Data charts: install matplotlib using pip install matplotlib
  9.   Made a bar chart for statistical analysis of the count of people with pre-exisiting' health conditions vs no health conditions.
Challenges
  • Plot not showing up
    1. Needed to execute all the cells -- not just the plot code
    2. Made sure that there isn't another code cell that is causing an issue -- restarted the kernel and ran the cell again
    3. Isolate the plot code in a new Jupyter Notebook cell -- to isolate any interference from other cells.
  • Had new_dfP = (df.iloc[:,7:17] == 1) | (df.iloc[:,7:17] == 2) to filter the original dataframe, but should be a mask for selected rows that has values of either 1 "yes" or 2 "no".
    1. Incorrect code: plt.bar(new_dfP.index, countY, label = 'Health Conditions', width = 0.4)
    2. Fixed code: plt.bar(df.index, countY, label='Health Conditions', width=0.4)
  • Debug by countY = df.iloc[:, 7:17].eq(1).sum(axis=1); print(countY) found out that the countY was adding a column of the total number of yes pre-existing health conditions
  • Needed the countY to be the total number of yes/no for the bar chart
  • Added the totalCountY = countY.sum(); print(totalCountY)
  •   Fixed the plt.bar(df.index, totalCountY, label='Health Conditions', width=0.4); plt.bar(df.index, totalCountN, bottom=totalCountY, label='No Health Conditions', width=0.4) figured that the "bottom" parameter isn't needed so we can plot two bars side by side.

    Skills Learned & Tips

    • Filtering the data learned the difference from loc vs. iloc
    •   Loc is label indexing and can access multiple columns
        Iloc is for integer indexing
    • Messed up as countN = (df.iloc[:,7:17].eq(2).sum(axis=1)) instead had countN = (df.iloc[:,7:17].eq(2)) == 1 since wanted to sum all the yes and no health conditions
    • It is important when learning a new language to figure out the proper document for the plots
    • If install a new library or add-in, rerun and close the Jupyter Notebook because it won't work if so

    Overall Thoughts

    It was really fun learning a new library in python. It challenged me a lot since it has been a while since I had coded in python -- but it was quick for me to pick it up and think a way to opimtize the code faster instead of having too many iterations. I hope to spend more time learning about Pandas and its capabilities to process data a bit more since it has only been one day since I heard of the library. The bar chart wasn't showing up. I was up late trying to figure it out, but went to sleep and figured it out in the morning.

    Moral of the story: “Pandas...Not just an animal. Sleep is the best debug. :D”